Technical  Research  Report 


Split  Recursive  Least  Squares:  Algorithms, 
Architectures,  and  Applications 


by  A-Y.  Wu  and  K.J.R.  Liu 


T.R.  94-37 


Sponsored  by 

the  National  Science  Foundation 
Engineering  Research  Center  Program, 
the  University  of  Maryland, 

Harvard  University, 
and  Industry 


Report  Documentation  Page 

Form  Approved 

OMB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 

1.  REPORT  DATE 

1994 

2.  REPORT  TYPE 

3.  DATES  COVERED 

00-00-1994  to  00-00-1994 

4.  TITLE  AND  SUBTITLE 

5a.  CONTRACT  NUMBER 

Split  Recursive  Least  Squares:  Algorithms,  Architectures,  and 

a  — 

5b.  GRANT  NUMBER 

/ippilLiUlUllb 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Department  of  Electrical  Engineering, Institute  for  Systems 

Research, University  of  Maryland, College  Park, MD, 20742 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

see  report 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 
ABSTRACT 

18.  NUMBER 
OF  PAGES 

33 

19a.  NAME  OF 
RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


Split  Recursive  Least  Squares:  Algorithms,  Architectures,  and 

Applications 


An- Yeu  Wu  and  K.  J.  Ray  Liu 


Electrical  Engineering  Department  and  Institute  for  Systems  Research 

University  of  Maryland 
College  Park,  MD  20742 
Phone:  (301)  405-6619,  Fax:  (301)  405-6707 


ABSTRACT 

In  this  paper,  a  new  computationally  efficient  algorithm  for  recursive  least-squares  (RLS)  fil¬ 
tering  is  presented.  The  proposed  Split  RLS  algorithm  can  perform  the  approximated  RLS  with 
O(N)  complexity  for  signals  having  no  special  data  structure  to  be  exploited,  while  avoiding  the 
high  computational  complexity  (0(N2))  required  in  the  conventional  RLS  algorithms.  Our  per¬ 
formance  analysis  shows  that  the  estimation  bias  will  be  small  when  the  input  data  are  less  cor¬ 
related.  We  also  show  that  for  highly  correlated  data,  the  orthogonal  preprocessing  scheme  can 
be  used  to  improve  the  performance  of  the  Split  RLS.  Furthermore,  the  systolic  implementation 
of  our  algorithm  based  on  the  QR-decomposition  RLS  (QRD-RLS)  array  as  well  as  its  application 
to  multidimensional  adaptive  filtering  is  also  discussed.  The  hardware  complexity  for  the  resulting 
array  is  only  0(N )  and  the  system  latency  can  be  reduced  to  0(log2JV).  The  simulation  results 
show  that  the  Split  RLS  outperforms  the  conventional  RLS  in  the  application  of  image  restoration. 
A  major  advantage  of  the  Split  RLS  is  its  superior  tracking  capability  over  the  conventional  RLS 
under  non-stationary  environments. 


The  work  is  supported  in  part  by  the  ONR  grant  N00014-93-10566  and  the  NSF  grant 
MIP93-09506. 


1  Introduction 


The  family  of  recursive  least-squares  (RLS)  adaptive  algorithms  are  well  known  for  their  superiority 
to  the  LMS-type  algorithms  in  both  convergence  rate  and  misadjustment  [1][2].  In  general,  the  RLS 
algorithms  do  not  impose  any  restrictions  on  the  input  data  structure.  As  a  consequence  of  this 
generality,  the  computational  complexity  is  0(N 2)  per  time  iteration,  where  N  is  the  size  of  the 
data  matrix.  This  becomes  the  major  drawback  for  their  applications  as  well  as  for  their  cost- 
effective  implementation.  To  alleviate  the  computational  burden  of  the  RLS,  the  family  of  fast 
RLS  algorithms  such  as  fast  transversal  filters,  RLS  lattice  filters,  and  QR-decomposition  based 
lattice  filters  (QRD-LSL),  have  been  proposed  [2].  By  exploiting  the  special  structure  of  the  input 
data  matrix,  they  can  perform  RLS  estimation  with  O(N)  complexity.  One  major  disadvantage 
of  the  fast  RLS  algorithms  is  that  they  work  for  data  with  shifting  input  only  ( e.g .,  Toeplitz  or 
Hankel  data  matrix).  In  many  applications  like  multichannel  adaptive  array  processing  and  image 
processing,  the  fast  RLS  algorithms  cannot  be  applied  because  no  special  matrix  structure  can  be 
exploited.  In  this  paper,  we  propose  an  approximated  RLS  algorithm,  which  is  called  the  Split  RLS 
,  based  on  the  projection  method.  Through  multiple  decomposition  of  the  signal  space  and  making 
suitable  approximations,  we  can  perform  RLS  for  non-structured  data  with  O(N)  complexity.  Thus, 
both  the  complexity  problem  in  the  conventional  RLS  and  the  data  constraint  in  the  fast  RLS  can 
be  resolved. 

The  projection  method  has  been  used  to  solve  large  and  sparse  consistent  linear  equations  such  as 
partial  differential  equations  (PDE).  Given  a  linear  equation  Ax  =  b,  where  A  E  TZmxn,  b  E  lZmxl , 
there  are  two  kinds  of  projection  methods  to  solve  it:  For  the  consistent  systems  (m  =  n),  the  linear 
equation  is  decomposed  into  several  smaller  linear  equations  by  “row-partitioning” .  Then  x  can  be 
solved  by  iteration  methods  such  as  Kaczmarz  projection  method  and  Cimmino  projection  method 
[3] [4] [5] .  For  the  inconsistent  systems  (m  >  n),  A  is  decomposed  into  smaller  submatrices  by 
“column-partitioning”.  Then  x  and  the  residual  can  be  solved  by  gradient-based  iteration  method 
[6].  Because  the  whole  data  matrix  is  used  to  compute  the  gradient,  it  is  non-adaptive  in  nature 
and  the  convergence  rate  depends  on  the  property  of  the  A  matrix. 

In  this  paper,  we  will  use  the  concept  of  column-partitioning  to  solve  the  non-structured  RLS 
so  that  the  computational  complexity  can  be  reduced.  The  signal  space  A  is  first  partitioned  into 
two  equal- dimensional  signal  subspaces.  After  performing  RLS  on  each  subspace,  we  try  to  find 
an  approximated  optimal  projection  vector  (of  the  the  whole  signal  space)  from  the  two  optimal 
projection  vectors  of  each  signal  subspace.  Through  the  steps  of  decomposition  and  approximation, 
the  complexity  of  the  RLS  can  be  reduced  by  nearly  half.  If  now  we  repeatedly  apply  the  same 
decomposition  and  approximation  to  each  signal  subspace,  the  RLS  estimation  can  be  solved  with 
0(N )  complexity  by  this  “divide-and-conquer”  approach.  We  shall  call  such  RLS  estimation  the 
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Split  RLS.  The  systolic  implementation  of  the  Split  RLS  based  on  the  QR-decomposition  RLS 
(QRD-RLS)  systolic  array  in  [7]  is  also  proposed.  The  hardware  complexity  for  the  resulting  RLS 
array  can  be  reduced  to  O(N)  and  the  system  latency  is  only  O(log2  N). 

It  is  noteworthy  that  since  approximation  is  made  while  performing  the  Split  RLS,  our  ap¬ 
proach  is  not  to  obtain  exact  least-squares  (LS)  solutions.  The  approximation  errors  will  introduce 
misadjustment  (bias)  to  the  LS  errors.  In  order  to  know  under  what  circumstances  the  algorithm 
will  produce  small  and  acceptable  bias,  we  also  provide  some  basic  analyses  for  the  performance 
of  the  Split  RLS.  The  analyses  together  with  the  simulation  results  indicate  that  the  Split  RLS 
works  well  when  applied  to  broad- band/less-correlated  signals.  Based  on  this  observation,  we  also 
propose  the  orthogonal  preprocessing  scheme  to  improve  the  performance  of  the  Split  RLS.  By 
using  the  transformed  signals,  which  are  less  correlated  than  the  original  ones,  as  the  inputs  of  the 
Split  RLS,  we  can  lower  the  bias  even  if  the  inputs  are  narrow-band/highly-correlated  signals. 

In  the  last  part  of  this  paper,  we  apply  the  Split  RLS  to  the  multidimensional  adaptive  filtering 
(MDAF)  based  on  the  architecture  in  [8].  By  incorporating  the  well-known  McClellan  Transfor¬ 
mation  (MT)  with  the  Split  RLS  systolic  array,  we  can  perform  two-dimensional  (2-D)  adaptive 
filtering  with  only  O(N)  hardware  complexity  and  with  unit  throughput  rate.  Due  to  the  fast  con¬ 
vergence  rate  of  the  Split  RLS,  the  Split  RLS  performs  even  better  than  the  full-size  QRD-RLS  in 
the  application  of  real-time  image  restoration.  This  also  indicates  that  the  Split  RLS  is  preferable 
under  non-stationary  environment. 

The  rest  of  this  paper  is  organized  as  follows.  The  projection  method  and  the  Split  RLS 
algorithm  are  derived  in  Section  2.  The  systolic  implementation  of  the  proposed  algorithm  based  on 
the  QRD-RLS  array  is  then  described  in  Section  3.  The  performance  analysis  and  simulation  results 
are  discussed  in  Section  4.  An  improved  Split  RLS  algorithm  using  the  orthogonal  preprocessing 
scheme  is  considered  in  Section  5.  Finally,  the  application  of  the  Split  R  LS  in  2-D  adaptive  filtering 
is  presented  in  Section  6. 

2  The  Projection  Method 

Given  an  observation  data  matrix  A  =  [al5  a2,  •  •  • ,  a„]  €  7 ZmXn  without  any  exhibited  structure 
and  the  desired  signal  vector  y  €  TZmxl,  the  LS  problem  is  to  find  the  optimal  weight  coefficients 

w  =  [wi,w2,---,tnn]r  (1) 

which  minimize  the  LS  errors 

||ef  =  |!Aw-y||2.  (2) 
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In  general,  w  is  of  the  form  [9]: 


w  -  (AtA)  1ATy. 


(3) 


We  also  have 

y  =  Aw  =  Py,  e  =  y  —  y  (4) 

where  y  is  the  optimal  projection  of  y  on  the  column  space  of  A,  P  =  A(A7  A)~1AT  is  the 
projection  matrix,  and  e  is  the  optimal  residual  vector.  The  principle  of  orthogonality  ensures  that 
e  is  orthogonal  to  the  column  space  of  A.  For  RLS  algorithms  that  calculate  exact  LS  solution,  such 
a  direct  projection  to  the  TV-dimensional  space  takes  0(N2)  complexity.  Knowing  this,  in  order  to 
reduce  the  complexity,  we  shall  try  to  perform  projection  onto  spaces  of  smaller  dimension. 

To  motivate  the  idea,  let  us  consider  the  LS  problem  with  the  partition  A  =  [Ai,  A2],  where 
Ai,  A2  €  7£nx(m/2).  Now  instead  of  projecting  y  directly  onto  the  space  spanned  by  A  (denoted 
as  span{ A}),  we  project  y  onto  the  two  smaller  subspaces,  spanjAi}  and  spcm{A2},  and  obtain 
the  optimal  projections  yi  and  y2  on  each  subspace  (see  Fig.l).  The  next  step  is  to  find  a  “good” 
estimation  of  the  optimal  projection  y,  say  y  approx-  If  we  can  estimate  a  1-D  or  2-D  subspace  from 
yi  and  y2  and  project  the  desired  signal  y  directly  on  it  to  obtain  y approx,  the  projection  spaces 
become  smaller  and  the  computational  complexity  is  reduced  as  well.  There  are  two  basic  criteria 
for  a  good  estimation  of  y.  First,  it  should  be  in  the  column  space  of  A  matrix,  be.,  it  must  be 
a  linear  combination  of  the  column  vectors.  Second,  it  should  be  as  close  to  the  real  projection  y 
as  possible  so  that  the  estimation  error  can  be  reduced.  What  worth  mentioning  is  that  since  the 
problem  itself  is  adaptive  processing  in  nature,  as  long  as  y  approx  will  be  eventually  close  to  y,  the 
initial  distance  between  y  and  y  approx  is  not  an  issue.  In  the  following,  we  propose  two  estimation 
methods  based  on  their  geometric  relationship  in  the  Hilbert  space. 

2.1  Estimation  Method  I  (Split  RLS  I) 

The  first  approach  is  simply  to  add  the  two  subspace  projections  yi  and  y2  together,  i.e., 

y  approx  =  yi  +  y2-  (3) 

This  provides  the  most  intuitive  and  simplest  way  to  estimate  y approx-  We  will  show  later  that  as 
yi  and  y2  are  more  orthogonal  to  each  other,  y approx  will  approach  to  the  optimal  projection  vector 

y- 

Let  Fig. 2(a)  represent  one  of  the  existing  RLS  algorithms  that  project  y  onto  the  Ar-dimensional 
space  of  A  and  compute  the  optimal  projection  e  (or  y,  depending  on  the  requirements)  for  the 
current  iteration.  The  complexity  is  0(N2)  per  time  iteration  for  the  data  matrix  of  size  N.  Now 
using  Fig. 2(a)  as  a  basic  building  block,  we  can  construct  the  block  diagram  for  estimation  method 
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I  as  shown  in  Fig. 2(b).  Because  the  whole  projection  space  is  first  split  into  two  equal  but  smaller 
subspaces  to  perform  the  RLS  estimation,  we  shall  call  this  approach  the  Split-RLS  (SP-RLS).  It 
can  be  easily  shown  that  the  complexity  is  reduced  by  nearly  half  through  such  a  decomposition. 

The  RLS  algorithm  based  on  estimation  method  I  (SP-RLS  I)  can  be  stated  as  foDows,  where 
RLS(A,y,  IV)  denotes  the  RLS  algorithm  in  Fig. 2(a)  and  returns  y(n)  (or  e(n)),  the  last  element 
of  y  (or  e),  for  the  current  iteration. 

Algorithm  1  (SP-RLS  I)  Given  the  input  data  vector  a(n)  =  [a^n),  a2(n)  ,  ■  ■  ■ ,  ajv(n)]T  and 
the  desired  signal  y{n)  at  time  n,  the  SP-RLS  I  computes  the  current  approximated  optimal  residual 
^approx(^)  follows! 


SP-RLS  I 

1.  Update  the  data  matrix  and  the  desired  data  vector  by 


A  (n) 


A(n  -  1) 
a 1  (n) 


>y(n)  = 


y(«  -  i) 

y{n) 


2.  Decompose  A (n)  into  two  equal-dimensional  data  matrices  as  A (n) 
compute  the  current  optimal  projection  of  each  subspace  by 


[Ai(n), A2(t?)].  Then 


yi(n)  =  RLS(Ai(n),y(n),  N/2), 
y2{n)  =  RLS{A2(n),y(n),  N /2). 


3.  Update  the  estimated  optimal  projection  vector: 


yapprox(^)  — 


yapprox(rc  1) 
Vi(n)  +  y2(n ) 


4 ■  Project  the  desired  signal  y (n)  onto  the  1-D  vector  yapprox(n)  to  obtain  eapprox(n): 

^approx  (n)  —  RLS(yappTOx(n),y(n),  1). 


2.2  Estimation  Method  II  (Split  RLS  II) 

In  estimation  method  I,  we  try  to  project  y  onto  the  estimated  optimal  projection  vector  y avvTOx- 
In  this  approach,  we  will  project  y  directly  onto  the  2-D  subspace  A  =  sp«n{yi,y2}.  As  a  result, 
the  estimation  shall  be  more  accurate  with  slightly  increase  in  complexity. 

As  with  estimation  method  I,  we  can  construct  the  block  diagram  for  estimation  method  II  (see 
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Fig. 2(c))  which  is  similar  to  Fig. 2(b)  except  for  the  post-processing  part.  The  projection  residual 
on  span{ yi,y2}  is  computed  through  a  2-input  RLS  block  with  yj  and  y2  as  the  inputs.  The  RLS 
algorithm  based  on  estimation  method  II  (SP-RLS  II)  is  as  follows: 

Algorithm  2  (SP-RLS  II)  Algorithm  SP-RLS  II  is  similar  to  the  SP-RLS  I  except  that  step  3 
and  4  are  modified  as: 

3.  Construct  the  n-by-2  matrix  A (n)  by 

A(n)  = 

where  A(0)  =  O. 

4 ■  Project  the  desired  signal  y (n)  onto  A{n)  to  obtain  eapprox(n): 

^approx  (^  )  =  RiS(A(n),y(n),2). 

2.3  Tree-Split  RLS  based  on  Estimation  Method  I  and  II 

In  estimation  method  I  and  II,  we  try  to  reduce  the  complexity  by  making  one  approximation  at  the 
last  stage.  Now  consider  the  block  diagram  in  Fig. 2(c).  If  we  repeatedly  expand  the  two  building 
blocks  on  the  top  by  applying  the  same  decomposition  and  approximation,  we  will  obtain  the  block 
diagram  in  Fig. 2(d).  We  shall  call  this  new  algorithm  the  Tree-Split  RLS  algorithm  (TSP-RLS)  due 
to  its  resemblance  to  a.  binary  tree.  The  TSP-RLS  algorithm  based  on  Fig.2(d)  is  shown  below. 

Algorithm  3  (TSP-RLS  II)  Given  the  input  data  vector a(n)  =  [aj(n),  <Z2(n)  ,  •  •  -,a/v(n)]T  and 
the  desired  signal  y(n)  at  time  n,  the  TSP-RLS  II  computes  the  current  approximated  optimal  resid¬ 
ual  eapprox(n)  as  follows: 


A (n  -  1) 
)/i(n),t/2(ft) 


TSP-RLS  II 

Initialization:  A(/j(0)  =  O,  for  l  =  0, 1, . . . ,  log2  N,  where  Apj  denotes  the  data  matrix  at  the  Ith 
stage  in  the  TSP-RLS. 


1.  Set  1  =  0,  aj0)(n)  =  a (n),  and  update  y (ra)  as 


y{n)  = 


y(»- !) 

y(n) 
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2.  Update  A (/)(«)  as 


A(0(n)  = 


A (i)(n  -  1) 

a  f,)(n) 


3.  If  N  >  2,  compute  the  approximated  RLS  for  the  current  stage: 

(a)  Decompose  A (/)(n)  into 

A(,)(n)  =  [A1((;)(n),  A2)(/)(n), . . . ,  A^/2  (;)(n)] 

where  A is  a  n-by-2  data  matrix. 

(b)  Compute  yt(n )  via 


Vi(n)  =  RLS(Ait(i)(n),  y(n),  2),  /or  i  =  1,2,. A/2. 

(c/  Form  the  output  vector  of  the  current  stage  as  y(/)(n)  =  [?/i(n),  y2{n), . . . ,  2/Ar/2(70]T- 

(d)  Set  the  input  vector  to  the  next  stage  as  a(/+1)(n)  =  y (/)(n). 

(e)  Set  l  =  l  +  1,  N  =  A/2.  Repeat  step  2-f. 

4 .  Otherwise  (reach  the  final  stage),  apply  the  RLS  to  compute  e^ppTOX(n): 


^approx(^)  —  RLS( A(/)(7l),  y(?l),  2 ), 


cmc/  exit. 

Likewise,  we  can  derive  the  TSP-RLS  algorithm  from  estimation  method  I  (TSP-RLS  I)  by  using 
the  block  diagram  in  Fig. 2(b). 

Lemma  1  The  computational  complexity  of  the  TSP-RLS  algorithm  is  0(A). 

Proof:  For  TSP-RLS  II  only.  Let  the  computational  complexity  per  time  iteration  for  a  A-input 
RLS  be  C/v-  Also  let  C2  =  k  for  a  2-input  RLS,  where  k  is  a  constant.  From  the  block  diagram 
in  Fig. 2(c),  we  know  that  the  evaluation  of  the  A-input  RLS  is  decomposed  into  the  evaluation  of 
two  A/2-input  RLS’s  plus  one  2-input  RLS.  Hence, 

Cat  =  2Cjv/2  +  k.  (6) 
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Since  the  TSP-RLS  II  is  obtained  by  recursively  expanding  the  block  diagram  of  the  SP-RLS  II, 
the  C;v  of  TSP-RLS  II  can  be  computed  by  recursively  expanding  Cjy  in  (6): 

/-i 

C/v  —  2(2C;v/4  T  k)  +  k  =  . . .  =  o} C jy j2i  +  k  2”  =  (2^"*_1  —  1  )k,  (7) 

n=0 

where  /  can  be  computed  by  setting  CN/2i  =  C2,  i-e.,  I  =  log2  TV  —  1.  Thus, 


C.v  =  (TV  -  1  )k 


(8) 


which  is  on  the  order  of  TV.  □ 

3  Systolic  Implementation 

In  this  section,  we  will  present  the  systolic  implementation  of  the  above  algorithms.  First  of  all, 
we  should  note  that  each  RLS  building  block  in  Fig.2  is  independent  of  choices  of  RLS  algorithms. 
Because  the  QRD-RLS  array  in  [7]  can  compute  the  RLS  estimation  in  a  fully- pipelined  way,  it  is 
a  good  candidate  for  our  purpose.  However,  the  original  array  computes  only  the  optimal  residual. 
In  order  to  obtain  the  two  optimal  subspace  projections  yq  and  y2,  a  delayed  version  of  y(n)  (the 
desired  signal  at  time  n)  should  be  kept  in  the  rightmost  column  of  the  QRD-RLS  array.  Once  the 
residual  is  computed,  we  can  use 

fa(n)  =  y(n)  -  ci(n), 

Mn)  =  y(n )  -  h{n) 

to  obtain  the  two  subspace  projections.  Also,  the  delayed  y(n )  can  be  sent  to  the  next  stage  as 
input  so  that  no  global  communication  is  required.  Fig. 3  shows  the  modified  QRD-RLS  systolic 
array  and  the  detailed  operations  of  its  processing  elements  (PE’s).  In  the  following  discussions, 
we  shall  call  the  modified  QRD-RLS  array  the  projection  array ,  and  the  QRD-RLS  array  in  [7]  the 
residual  array ,  respectively. 

Now  based  on  the  block  diagram  in  Fig.2,  we  can  implement  the  Split  RLS  algorithms  in  the 
following  way:  For  those  RLS  blocks  which  need  to  compute  the  optimal  projection,  the  projection 
array  is  used  for  their  implementations,  while  for  those  RLS  blocks  which  need  to  compute  the 
optimal  residual  (usually  in  the  last  stage),  the  residual  array  is  used.  The  resulting  systolic 
implementations  of  the  SP-RLS  I, II  and  the  TSP-RLS  II  are  demonstrated  in  Fig. 4(a), (b)  and  (c). 

Lemma  2  The  two  TSP-RLS  systolic  arrays  (TSP-RLS  I, II)  consist  of  0(N )  angle  computers 
and  rotators,  and  the  total  system  delay  is  O(log2  N). 

Proof:  Suppose  a  TV-input  TSP-RLS  II  array  requires  An  angle  computers  and  Rn  rotators.  From 
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(6)-(8),  we  have 


An  =  2AN/2  +  2  =  (2l+1  -  1)2  =  2 (N  -  1), 

Rn  =  2 Rn/2  +  3  =  (2/+1  -  1)3  =  3(A  -  1). 

On  the  other  hand,  let  the  total  system  delay  for  a  A-input  SP-RLS  II  array  be  T/v 
we  have 

Rn  =  Tn/2  +  3. 

Then  T/v  for  the  TSP-RLS  II  can  be  obtained  by  expanding  Tn  in  (11): 

7/v  =  (Tn/22  +  3)  +  3  =  . . .  =  3  •  (/  +  1).  (12) 

Recall  from  Lemma  1  that  l  =  log2  N  -  1.  Thus,  T/v  =  3  •  log2  N .  Similarly,  it  can  be  shown  that 
An  =  Rn  =  2 N  -  1  and  Tn  =  2  •  (log2  Ar  +  1)  for  the  TSP-RLS  I  array.  □ 

A  comparison  of  hardware  cost  for  the  full-size  QRD-RLS  in  [7]  (denoted  as  FULL-RLS),  SP- 
RLS,  TSP-RLS,  and  QRD-LSL  [2,  chap. 18],  is  listed  in  Table  1.  As  we  can  see,  the  complexity  of 
the  TSP-RLS  is  comparable  with  the  QRD-LSL  which  requires  shift  data  structure. 

4  Performance  Analysis  and  Simulation  Results 

It  is  noteworthy  that  our  approach  is  not  an  exact  LS  solution  since  the  constructed  y approx  is 
just  an  approximation  of  the  optimal  projection  vector.  This  approximation  error  will  introduce 
misadjustment  (bias)  to  the  LS  estimation.  In  the  sequel,  we  will  try  to  analyze  the  bias  for  SP- 
RLS  I  and  SP-RLS  II  by  investigating  the  relationship  between  the  optimal  projection  of  the  whole 
space,  y,  and  the  optimal  projections  of  the  two  equal-dimensional  subspaces,  yi  and  y2.  Due  to 
the  multiple  RLS  approximations  in  the  TSP-RLS  algorithm,  it  is  almost  impossible  to  provide 
an  exact  close-form  solution  to  the  final  output  of  the  TSP-RLS.  Nevertheless,  the  analysis  of  the 
SP-RLS  algorithms  can  give  us  an  idea  that  under  what  conditions  will  the  algorithms  produce 
small  and  acceptable  misadjustment. 

4.1  Estimation  Error  for  SP-RLS  I 

Consider  the  LS  problem  in  (2)  and  decompose  the  column  space  of  A  into  two  equal-dimensional 
subspaces,  be.,  A  =  [Ai  ,  A2].  Let  wT  =  [wf  ,w^],  then  the  optimal  projection  vector  y  can  be 
represented  as 

y  —  Aw  =  yi  +  y2  (13) 


(10) 

.  From  Fig.4(b), 
(11) 
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where  yi  =  AiWi  and  y2  =  A2W2.  From  the  normal  equations 

ArAw  =  Ary,  (14) 

we  have 

AfAiWi  +  AJA2w2  =  Afy,  (15) 

A^AiWj  -I-  A2W2  =  A|y.  (16) 

Let  Wi,yt,  i  —  1,2,  be  the  optimal  weight  vectors  and  the  optimal  projection  vectors  when 
considering  two  subspaces  sponjAi}  and  span{ A2}  separately.  From  (13)  and  (14),  they  are  given 

by 

w,'  =  (AfAi)_1Afy,  yi  =  A,wt-,  1=1,2.  (17) 

Premultiplying  Ai(AfAi)-1  on  (15)  and  using  (17),  we  have 

AiWj  +  Ai(Af  A!)_1Af  A2w2  =  A^j.  (18) 

Similarly,  from  (16)  and  (17)  we  can  obtain 

A2(A;fA2r1A2  AiWj  +  A2w2  =  A2w2.  (19) 

By  the  definitions  of  yi,y2,yi,y2,  (18)  and  (19)  can  be  written  as 

yi  +  Piy2  =  yi, 

P2yi  +  y2  =  y2 

where  P8,  i  =  1,2  are  the  projection  operators  defined  in  Section  2. 

In  SP-RLS  I,  we  estimate  the  optimal  projection  by 

y approx  —  yi  4”  y2,  (22) 

and  the  estimation  error  (bias)  is  given  by 

||Aei||  =  W&approx  —  ®||  =  lly  —  y approxll  •  (23) 

Substituting  (20)-(22)  into  (23)  yields 

|| Aej||2  =  ||y  -  yr  -  y2||2  -  ||Piy2  +  P2yi||2-  (24) 


(20) 

(2D 
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In  order  to  lower  the  bias  value,  Piy2  and  P2yi  should  be  as  small  as  possible.  Note  that 

Piy2  =  Ai(AfAi)_1Af  A2w2  =Ai^f11$12w2,  (25) 

P2yi  =  A2(A|’A2)“1A|’A1w1  =  A2$J21^21w1  (26) 

where  4>tJ  =  Aj  Aj  is  the  deterministic  correlation  matrix.  When  the  column  vectors  of  Aj  and 
A 2  are  more  orthogonal  to  each  other,  $12  and  4>2i  will  approach  to  zero  and  the  bias  is  reduced 
accordingly. 

4.2  Estimation  Error  for  SP-RLS  II 

Consider  the  block  diagram  of  the  SP-RLS  II  in  Fig. 2(c).  The  optimal  projection  of  y  onto  the 
space  span{yi,y2}  can  be  written  as 


y approx  —  fciyi  P  hy2 


(27) 


where  k  =  [fci,A:2]T  is  the  optimal  weight  vector.  From  the  normal  equations ,  we  have 

[yny2]r[yi,y2]k  =  [yi>y2]ry-  (28) 


Using  the  facts  that 


we  can  simplify  (28)  as  follows: 


yi  =  y  -  ei,  y2  =  y  -  e2, 
yfy  =  l|yill2,  y2ry  =  lly2|l2> 


*1  +*2(1  _  ~  1 

T  A. 


^l(1“  ^F)  +  ^’2  -  1 

Then  the  optimal  weight  vector  can  be  solved  as 


k  =  [ku  k}T  = 


~T~  ~T~ 

llyiil2’  Ilyzll2 


where 


a  —  1  — 


yfy2  yjyi 


-i 


(29) 


(30) 


(31) 


(32) 


v  llyiil2  l|y2|l2/ 

Note  that  yfy2  =  ||yi||||y2||  cos<?,  where  0  denotes  the  angle  between  these  two  vectors,  we  can 
rewrite  a  as 

a  =  (1  -  cos2  0)-1  =  esc2  0.  (33) 
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From  Fig.l,  we  have 


ll^approrll  —  ||y|!  1 1  y  approx  1 1 

=  ||y|!2  -  yTyaVVrox  (34) 

=  llyli2  —  ^illyill2  -  ^2||y2||2- 

Substituting  (31)  into  (34)  yields 

lleapproxll2  =  ||y ||2  -  csc20(yfe2  +  yfei) 

=  l|y||2  -  esc2  #[yf(y  -  y2)  +  y2T(y  -  yi)]  (35) 

=  lly |j2  -  esc2  <9||y!  -  y2||2. 

Thus,  the  bias  of  SP-R.LS  II  is  given  by 

||Ae2||2  =  \\eappr0T\\2  -  ||e||2 

=  l|y||2  -  esc2  #||yi  -  y2||2  -  (||y||2  -  ||y||2)  (36) 

=  l|y||2  -  csc26>||y!  -  y2||2. 

For  any  given  8,  it  can  be  shown  that  (see  Appendix)  ||Ae2||2  is  bounded  by 

||Ae2||2<||Ae1||2.  (37) 

This  implies  that  the  performance  of  SP-RLS  II  is  better  than  that  of  SP-RLS  I  in  terms  of 
estimation  error. 

4.3  Bandwidth,  Eigenvalue  Spread,  and  Bias 

From  (24)  and  (36)  we  know  that  the  orthogonality  between  the  two  subspaces  spanjAi}  and 
span{  A2}  will  significantly  affect  the  bias  value;  i.e.,  signals  with  different  degrees  of  orthogonality 
will  have  different  bias  values  for  the  Split  RLS  algorithm.  However,  in  practice,  the  evaluation 
of  degree  of  orthogonality  for  multidimensional  spaces  is  nontrivial  and  computationally  intensive 
(e.g.,  CS-decomposition  [10,  pp.  75-78]).  Without  loss  of  generality,  we  will  only  focus  our  discus¬ 
sion  on  single-channel  case,  where  the  data  matrix  A  consists  of  only  shifted  data  and  the  degree  of 
orthogonality  can  be  easily  measured.  In  such  a  case,  the  degree  of  orthogonality  can  be  measured 
through  two  indices:  the  bandwidth  and  the  eigenvalue  spread  of  the  data.  If  the  signal  is  less  cor¬ 
related  (orthogonal),  the  autocorrelation  function  has  smaller  duration  and  thus  larger  bandwidth. 
Noise  processes  are  examples.  On  the  other  hand,  narrow-band  processes  such  as  sinusoidal  signals 
are  highly  correlated.  If  the  data  matrix  is  completely  orthogonal,  all  the  eigenvalues  are  the  same 
and  the  condition  number  is  one.  This  implies  that  if  the  data  matrix  is  more  orthogonal,  it  will 
have  less  eigenvalue  spread.  It  is  clear  from  our  previous  discussion  that  the  SP-RLS  will  render 
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less  bias  for  the  broad-band  signals  than  for  the  narrow-band  signals. 

As  to  the  TSP-RLS,  note  that  the  output  optimal  projection  is  a  linear  combination  of  the 
input  column  vectors.  If  the  inputs  to  one  stage  of  the  TSP-RLS  array  are  less  correlated,  the 
outputs  of  this  stage  will  still  be  less  correlated.  As  an  example,  suppose  now  the  inputs  of  the 
TSP-RLS  II  array  are  completely  orthogonal,  we  have 

y,  =  +  w2la2i,  for  i  =  1,2,..., A/2  (38) 

where  u>2i_i  and  w2l  are  the  optimal  weight  coefficients  in  each  subarray.  It  can  be  easily  seen  that 
yjy3  =  0,  for  i  ^  j.  The  orthogonality  of  the  original  inputs  is  still  preserved  at  the  next  stage. 
Therefore,  the  signal  property  at  the  first  stage  such  as  bandwidth  plays  an  important  role  in  the 
overall  performance  of  the  TSP-RLS. 

4.4  Simulation  Results 

In  the  following  simulations,  we  will  use  the  autoregressive  (AR)  process  of  order  p  (AR(p))  to 
generate  the  simulation  data 

p 

u(n )  =  y;  Wj  u(n  -  i)  +  v(n)  (39) 

2  =  1 

where  v(n)  is  a  zero-mean  white  Gaussian  noise  with  power  equal  to  0.1.  Besides,  the  pole  locations 
of  the  AR  processes  are  used  to  control  the  bandwidth  property:  As  the  poles  are  approaching  the 
unit  circle,  we  will  have  narrow-band  signals;  otherwise,  we  will  obtain  broad-band  signals.  All  the 
simulation  results  are  based  on  the  average  of  100  independent  trials. 

In  the  first  experiment,  we  try  to  perform  fourth-order  linear  prediction  (LP)  with  the  AR(4) 
processes  using  the  SP-RLS  and  TSP-RLS  systolic  arrays  described  in  Section  3.  In  this  case,  the 
SP-RLS  II  is  equivalent  to  the  TSP-RLS  II  because  they  have  identical  implementations.  Table  2 
shows  the  AR(4)  models  used  in  this  experiment.  In  model  I  and  II,  the  two  poles  are  at  the  same 
radii  varied  from  0.5  to  0.95.  In  model  III  and  IV,  one  pole  is  fixed  and  the  other  is  variable.  For 
each  model,  the  LP  problem  is  repeated  for  ten  times  by  varying  the  poles  location  from  0.5  to  0.95 
with  0-05  increment.  The  simulation  results  are  shown  in  Fig. 5,  in  which  the  :r-axis  represents  the 
location  of  the  variable  poles  in  model  I-IV,  and  y- axis  represents  the  average  output  noise  power 
after  convergence.  Ideally  the  output  should  be  the  noise  process  v(n)  with  power  equal  to  0.1.  As 
we  can  see,  when  the  bandwidth  of  input  signal  becomes  wider,  the  bias  is  reduced.  This  agrees 
perfectly  with  what  we  expected. 

Beside  the  bias  values,  we  also  plot  the  square  root  of  the  spectral  dynamic  range  D  (the  ratio 
of  the  maximum  to  the  minimum  amplitude  on  the  AR  power  spectrum)  associated  with  each  AR 
model.  It  is  known  that  the  eigenvalue  spread  of  the  data  signal  is  bounded  by  the  spectral  dynamic 


12 


range  [11] 


1  <  ^£2£[  <  max{|£7(e^)|2}  a 
Amin  “  min{|t/(e^)|2} 


(40) 


where  U(eJLJ)  is  the  spectrum  of  u(n).  From  the  simulation  results,  we  see  the  consistency  between 
the  bias  value  and  the  spectral  dynamic  range.  This  indicates  that  the  performance  of  the  Split 
RLS  algorithms  is  also  affected  by  the  eigenvalue  spread  of  the  input  signal.  This  phenomenon  is 
similar  to  what  we  have  seen  in  the  LMS-type  algorithms. 

In  the  second  experiment,  we  extend  the  previous  experiment  to  perform  eighth-order  LP  for 
four  AR(8)  processes.  The  setting  for  the  poles  is  listed  in  Table  3.  The  simulation  results,  shown 
in  Fig. 6,  again  validate  the  bandwidth-bias  relationship.  Beside  the  bias  effect,  two  observations 
can  be  made  from  these  two  experimental  results: 


1.  The  SP-RLS  performs  better  than  the  TSP-RLS.  This  is  pretty  much  due  to  the  number  of 
approximation  stages  in  each  algorithm. 

2.  The  overall  performance  of  SP-RLS  II  is  better  than  that  of  SP-RLS  I.  This  agrees  with  our 
analysis  in  (37). 

Next  we  want  to  examine  the  convergence  rate  of  our  algorithm.  An  AR(7)  model  is  used  to 
generate  data  and  the  sum  of  the  current  inputs  is  used  as  the  desired  signal.  The  output  should 
be  zero  after  it  converges.  Fig. 7  shows  the  convergence  curve  for  the  8-input  FULL-RLS  and  the 
TSP-RLS  II  after  some  initial  perturbation.  It  is  interesting  to  note  that  although  the  TSP-RLS  II 
has  some  bias  after  it  converges,  its  convergence  rate  is  faster  than  that  of  the  FULL-RLS.  This  is 
due  to  the  fact  that  the  O(log2  N )  system  latency  of  the  TSP-RLS  is  less  than  the  O(N)  latency  of 
the  FULL-RLS.  Also,  to  initialize  an  8-input  full-size  array  takes  more  time  than  to  initialize  the 
three  small  cascaded  2-input  arrays.  The  property  of  faster  convergence  rate  is  especially  preferred 
for  the  tracking  of  parameters  in  non-stationary  environments.  In  Section  6  we  will  provide  an 
image  restoration  simulation  to  verify  this  observation. 


5  Projection  Method  with  Orthogonal  Preprocessing 

In  the  previous  sections,  we  have  seen  that  the  Split  RLS  performs  very  well  when  the  input  signal 
is  less  correlated  (or  broad-band).  However,  in  many  applications,  processing  of  highly-correlated 
(or  narrow-band)  signals  is  inevitable.  We  are  thus  motivated  to  investigate  a  way  to  improve  the 
Split  RLS  algorithm  when  dealing  with  highly-correlated  signals.  From  the  analyses  in  Section 
4,  we  know  that  the  estimated  optimal  projection  will  approach  to  the  real  optimal  projection 
when  all  subspaces  are  more  orthogonal  to  each  other.  Therefore,  if  we  can  preprocess  the  data 
matrix  such  that  the  column  spaces  become  more  orthogonal  (less  correlated)  to  each  other,  a 
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better  performance  is  expected.  Such  a  concept  has  been  employed  in  the  “Transform  domain  LMS 
algorithm”  (TDLMS)  [  1 2] [  1 3] [  1 4] ,  as  well  as  in  the  row-partitioning  projection  methods  [3] [4].  It 
is  clear  that  Gram-Schmidt  orthogonalization  will  render  an  excellent  performance.  However,  the 
0(N 2)  complexity  prevents  us  from  considering  it. 

5.1  Transform-Domain  LS  Problem 

In  transform-domain  signal  processing,  the  input  data  matrix  A  is  first  transformed  into  another 
data  matrix  Z 

Z  =  A  T  (41) 

where  T  i's  an  unitary  transformation  matrix  of  rank  N .  The  transform-domain  LS  problem  is  to 
find  the  optimal  weight  vector  k  =  [k\,k2,  -  •  •  ,k^j]T  which  minimizes  the  LS  error  ||Zk-y||2  in 
the  transform  domain.  Because  Z  and  A  span  the  same  signal  space,  the  LS  error  will  be  the  same 
as  in  (2).  The  transformation  process  can  be  viewed  as  a  set  of  filter  banks  with  equally  spaced 
mainlobes  [14].  Each  column  vector  of  Z  corresponds  to  the  output  signal  of  a  given  filter  in  the 
filter  banks.  Therefore,  the  column  vectors  of  Z  will  be  less  correlated  than  those  of  A.  This  helps 
us  to  obtain  a  better  y approx  according  to  our  observations  in  (24)  and  (36). 

The  operation  for  the  Split  RLS  with  orthogonal  preprocessing  is  as  follows:  First  perform  the 
orthogonal  transform  on  the  current  data  vector,  then  use  the  transformed  data  as  the  inputs  of 
the  Split  RLS.  In  our  approach,  the  Discrete  Cosine  Transform  (DCT)  and  the  Discrete  Hartley 
Transform  (DHT)  are  used  as  the  preprocessing  kernels.  As  to  the  hardware  implementation,  we 
can  employ  the  time-recursive  DCT/DHT  lattice  structure  in  [15]  to  continuously  generate  the 
transformed  data.  Fig. 8  shows  the  SP-RLS  I  array  with  DCT/DHT  preprocessing.  The  transform- 
domain  data  are  first  generated  through  the  DCT/DHT  lattice  structure,  then  are  sent  to  the 
SP-RLS  I  array  to  perform  the  RLS  filtering.  The  TSP-RLS  array  with  the  preprocessing  scheme 
can  be  constructed  in  a  similar  way.  Since  both  the  DCT/DHT  lattice  structure  and  the  TSP-RLS 
array  require  O(N)  hardware  complexity,  the  total  cost  for  the  whole  system  is  still  0(N). 

In  addition  to  the  two  aforementioned  transforms,  for  the  purpose  of  further  decorrelation,  we 
also  propose  a  new  preprocessing  scheme  called  the  Swapped  DCT  (SWAP-DCT)  based  on  the 
DCT.  Suppose  Z  =  [zi,z2,...,z jv]  is  the  DCT-domain  data.  In  the  DCT  preprocessing  given  in 
Fig. 8,  the  input  data  is  partitioned  as 


Ai  =  [zi,Z2,  •  •  •  ,Zjv/2], 

A-2  =  [zjV/2+l>  zN/2+2i  •••■>  zn}- 


(42) 
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To  make  the  input  data  more  uncorrelated,  we  permute  the  transformed  data  column  as 

Ai  =  [zj ,  Z3,  •  • • , Z2fc— 1 ,  •  •  • ,  ZjV— 1] ,  . 

(43) 

A  2  =  [z2,  Z4,  .  .  . ,  Z 2fc,  . . . ,  Zjv] 

in  the  SWAP-DCT  preprocessing  scheme.  Fig. 9  shows  the  spectrum  of  the  normal  DCT  partitioning 
and  the  SWAP-DCT  partitioning.  Recall  that  the  eigenvalue  spread  will  affect  the  bias  value,  and 
the  eigenvalue  spread  is  bounded  by  the  spectral  dynamic  range.  It  is  obvious  that  the  SWAP-DCT 
preprocessing  scheme  will  have  better  performance  due  to  the  smaller  eigenvalue  spread  in  both 
Ai  and  A2. 

5.2  Simulation  of  the  TSP-RLS  with  Orthogonal  Preprocessing 

To  validate  our  arguments  for  the  orthogonal  preprocessing,  we  will  repeat  the  two  experiments  in 
Section  4.4  for  the  TSP-RLS  II  with  three  different  preprocessing  schemes  (DCT,  DHT,  SWAP- 
DCT).  The  simulation  results  are  given  in  Fig. 10  and  Fig. 11.  In  general,  the  TSP-RLS  with  DCT 
preprocessing  gives  a  fairly  significant  improvement  in  the  bias  value  over  the  TSP-RLS  without 
any  preprocessing  (Normal  TSP-RLS).  Nevertheless,  some  exceptions  can  be  found  in  AR(4).III 
and  AR(8).III.  As  to  the  DHT,  it  does  not  perform  well  in  most  cases  except  in  AR(8).II  and 
AR(8).IV.  It  is  as  expected  that  the  SWAP-DCT  performs  better  than  the  DCT  in  most  cases. 
This  supports  our  assertion  for  the  effect  of  the  SWAP-DCT. 

From  the  simulation  results,  we  can  see  that  it  is  almost  impossible  to  find  one  transform  that 
is  optimal  for  all  signals.  This  is  also  true  for  the  TDLMS  algorithms  [13].  In  general,  the  DCT 
and  the  SWAP-DCT  are  good  choices  to  improve  the  performance. 

6  Application  to  Multidimensional  Adaptive  Filtering 

In  this  section,  we  will  apply  the  Split  RLS  to  the  multidimensional  adaptive  filtering  (MDAF) 
based  on  the  architecture  in  [8].  In  [8],  the  McClellan  Transformation  (MT)  [16]  was  employed  to 
reduce  the  total  parameters  in  the  2-D  filter  design,  and  the  QRD-RLS  array  in  [17]  was  used  as 
the  processing  kernel  to  update  the  weight  coefficients.  In  our  approach,  we  replace  the  QRD-RLS 
array  with  the  Split  RLS  array.  This  will  result  in  a  more  cost-effective  MDAF  architecture  while 
with  even  better  performance. 
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6.1  2-D  Adaptive  Filtering  using  McClellan  Transformation 

Given  a  1-D  zero-phase  FIR  filter  with  support  —N  <  i  <  N ,  the  frequency  response  can  be  written 
as 

N  N 

H(u>)  =  ^  h{  cos (icj)  —  ^2  /iiTjfcosu;].  (44) 

2=0  2=0 

where  T,-[*]  denotes  the  Chebyshev  polynomial  of  degree  i.  Using  the  transformation  of  variables 
[16] 

F(u>  \,u2)  — *  cos  a;,  (45) 

we  obtain  the  MT  2-D  frequency  response 

N 

H(u1,u>2)  =  '£hiTi[F(u1,u2)\.  (46) 

t=0 

The  MT  is  a  near-optimal  design  method  for  2-D  filters  [18,  chap.4].  It  decomposes  the  design 
problem  into  the  design  of  the  1-D  prototype  FIR  filter ,  hi,  i  =  0, 1,  •  •  •,  N,  and  the  2-D  transfor¬ 
mation  function,  F(u> i,u2).  The  former  defines  the  frequency  response  along  the  2-D  frequency 
plane,  while  the  latter,  which  is  usually  a  small  fixed  2-D  zero-phase  FIR  filter,  maps  the  1-D  fre¬ 
quencies  into  contours  in  the  2-D  frequency  plane.  Fig. 12  shows  the  block  diagram  which  performs 
2-D  filtering  based  on  the  MT  and  the  Chebyshev  recursion  [19] [8] .  Each  PE  is  a  linear  systolic 
array  realizing  the  2-D  transformation  function  in  (45)  with  xfini,  n2),  i  =  0,1,-- -,1V,  as  the  PE 
output.  y{n-[,  n2)  is  the  desired  2-D  signal,  and  h  =  [hi,h2,  •  ■■,hiy]T  is  the  tap  coefficient  vector  of 
the  1-D  prototype  filter.  In  [8],  h  is  updated  by  considering  Fig. 12  as  a  multichannel  LS  problem, 
i.e.,  h  is  obtained  by  minimizing  the  LS  error 

TV 

lk(rza,  tz2)||2  =  \\y{n\,n2)  -  y{ni,n2)\\2  =  \\y{n1,n2)  -  J2  M.(«i.  ?i2)||2  (47) 

i=0 

for  each  incoming  data.  For  the  systolic  implementation,  h  is  solved  through  the  QRD-RLS  array 
in  [17]  with  2q(rci,  n2)’s  and  y{n\,n2 )  as  the  array  inputs.  However,  the  opposite  data  wavefront 
in  the  QRD-RLS  array  as  well  as  the  0(N2)  hardware  complexity  makes  the  system  inappropriate 
for  cost-effective  pipelined  processing. 

In  some  applications,  such  as  image  restoration  and  image  registration,  the  estimation  error 
e(n\,n2)  is  the  only  parameter  of  interest.  In  such  a  case,  we  can  modify  the  MDAF  structure 
in  [8]  by  replacing  the  QRD-RLS  array  with  the  FULL-RLS  array  since  the  latter  produces  the 
LS  error  in  a  fully-pipelined  way.  To  further  reduce  the  hardware  complexity,  we  can  employ  the 
TSP-RLS  array  as  the  processing  kernel.  As  a  result,  we  can  perform  2-D  adaptive  filtering  with 
O(N)  hardware  complexity  and  with  unit  throughput  rate. 
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6.2  Simulation  with  TDALE 


The  performance  of  the  proposed  MDAF  architecture  is  examined  by  applying  it  to  a  two-dimensional 
adaptive  line  enhancer  (TDALE)  [20] [21]  for  image  restoration.  The  block  diagram  is  depicted  in 
Fig. 13.  The  primary  input  is  the  well-known  ’’LENA”  image  degraded  by  a  white  Gaussian  noise. 
A  2-D  unit  delay  z^1  z^1  is  used  as  a  decorrelation  operator  to  obtain  the  reference  image.  The 
image  signal  is  fed  into  the  system  in  the  raster  scanned  format  -  from  left  to  right,  top  to  bot¬ 
tom.  After  the  input  image  goes  through  the  TSP-RLS  array,  the  generated  estimation  error  is 
subtracted  from  the  reference  signal  to  get  the  filtered  image.  For  comparison,  we  also  repeat  this 
experiment  using  the  FULL-RLS  array. 

The  simulation  results  are  shown  in  Table  4  and  in  Fig.  14.  We  can  see  that  the  performance  of 
the  TSP-RLS  is  better  than  the  2-D  joint  process  lattice  structure  in  [21]  when  the  signal-to-noise 
ratio  (SNR)  is  low.  It  is  also  interesting  to  note  that  the  TSP-RLS  outperforms  the  FULL-RLS. 
As  we  discussed  in  Section  4.4,  although  the  TSP-RLS  has  misadjustment  after  convergence,  it 
converges  faster  than  the  FULL-RLS.  This  fast-tracking  property  is  preferable  under  non-stationary 
environments  where  convergence  is  very  unlikely. 

7  Conclusions 

In  this  paper,  we  introduced  a  new  O(N)  fast  algorithm  and  architecture  for  the  RLS  estimation 
of  nonstructured  data.  Compared  with  the  conventional  RLS,  this  new  approach  is  sub-optimal 
in  the  sense  that  it  introduces  extra  bias  to  the  LS  estimations.  Nevertheless,  we  have  shown 
that  the  bandwidth  and/or  the  eigenvalue  spread  of  the  input  signal  can  be  used  as  a  good  per¬ 
formance  index  for  these  algorithms.  Therefore,  the  users  will  have  small  bias  when  dealing  with 
broad-band/less-correlated  signals.  For  narrow-band  signals,  we  can  also  employ  the  orthogonal 
preprocessing  to  improve  its  performance.  The  low  complexity  as  well  as  the  fast  convergence 
rate  of  the  proposed  algorithm  makes  it  suitable  for  RLS  estimation  under  the  non-stationary  or 
fast-changing  environments  where  the  data  matrix  has  no  structure.  For  example,  one  possible 
application  of  the  Split  RLS  is  in  the  Sidelobe  Cancellor  (SLC),  in  which  the  inputs  of  the  auxiliary 
arrays  are  mainly  noises.  The  fast  tracking  capability  of  the  Split  RLS  algorithm,  as  demonstrated 
in  the  image  restoration  simulations,  provides  a  very  promising  potential  for  parameter  tracking 
under  non-stationary  environments.  Furthermore,  the  systolic  architecture  of  the  Split  RLS  is  fully 
parallel  and  pipelined  and  thus  provides  a  high-throughput  implementation  for  real-time  applica¬ 
tions. 
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Appendix 


In  this  appendix,  we  will  show  that  the  bias  of  SP-RLS  II  is  bounded  by  that  of  SP-RLS  I.  From 
(24)  and  (36),  we  have 

llAeill2  =  l|y-yi  -y2||2, 

llAe2||2  =  llyll2  -  esc2  0||yi  -  y2||2. 

Note  that  esc2  9  >  1  for  any  6.  Thus, 

llAei||2  -  llAe2||2  >  ||y  -  yi  -  y2||2  -  ||y||2  +  ||yi  -  y2||2 

=  2[(||yi|[2  +  ||y2||2)  -  yr(yj +y2)].  (49) 

From  (20)  and  (21),  we  have 

llyill2  =  llyill2  +  2yfy2  +  y2’(Piy2),  (50) 

I|y2||2  =  ||y2||2  +  2yfy2  +  yf(P2yi)  (51) 

where  the  fact  that  y?  (Piyg)  =  y2  (P2yi)  =  yfy2  is  used.  Combining  (50)  and  (51)  yields 

llyill2  +  ||y2||2  =  ||y||2  +  2yfy2  +  y^(Piy2)  +  yf(P2y1).  (52) 


On  the  other  hand, 

yT(yi  +  y2)  =  yT[(yi  +  Pifo)  +  (h  +  P2yi)]  =  llyll2  +  y1  (Pifo)  +  yT(P2yi  )•  (53) 

Substituting  (52)  and  (53)  into  (49),  we  have 

llAei||2-|lAe2||2  >  2[2yfy2-yf(Piy2)-y^(P2yi)]  =  2[(yJ’(y2-P1y2)  +  y^(yi-P2yi)].  (54) 

Becctuse  Pi  ctnd.  P2  projection  mettriceS)  it  is  eleeir  thcit  y2  ^  f*iy2?  yi  ^  P2yi-  Therefore. 
||Aei||2  -  ||Ae2||2  >  0,  t.c„  || Ae2 1 j2  <  ||Aea||2.  □ 
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No.  of 

Angle  Computers 

No.  of 
Rotators 

System 

latency 

FULL-RLS 

A 

A(A  +  l)/2 

A  +  1 

SP-RLS  I 

A  +  1 

A2/4  +  N/2  +  1 

A/2  +  3 

SP-RLS  II 

A  +  2 

A2/ 4  +  N/2  +  3 

A/2  +  4 

TSP-RLS  I 

2  A  -  1 

2  A  -  1 

2(log2  A  +  1) 

TSP-RLS  II 

2(N  —  1) 

3 (A  -  1) 

3  log2  A 

QRD-LSL 

2  A  +  1 

3A  +  1 

A  +  1 

Table  1:  Comparison  of  hardware  cost  for  the  FULL-RLS,  SP-RLS,  TSP-RLS,  and  QRD-LSL, 
where  the  QRD-LSL  requires  shift  data  structure. 


AR(4) 

Pi 

P2 

<h 

<t>2 

I 

0.5  -  0.95 

Pi 

00 

rH 

00 

II 

0.5  -  0.95 

Pi 

5/8tt 

OO 

III 

0.5  -  0.95 

0.6 

1/8tt 

OO 

IV 

0.6 

0.5  -  0.95 

5/87T 

OO 

r— 

Table  2:  List  of  the  AR(4)  models  used  in  Experiment  1  (with  poles  at  pie^3^1  and  p2e±3'i>'2) 


AR(8) 

pi 

P2 

P3 

P4 

<t>  1 

<f>  2 

<t>  3 

(p4 

I 

WEBEM 

Pl 

Pi 

1  / 1 5  7T 

4/157T 

8/157T 

12/15t r 

II 

0.5  -  0.95 

Pi 

Pi 

Pi 

2/15tT 

5/157T 

OO 

>— k 

cn 

1 1  / 1 5  7T 

III 

0.5  -  0.95 

Pi 

0.6 

1/157 r 

4/157T 

OO 

t— k 

12/157T 

msm 

0.6 

0.5  -  0.95 

0.6 

P2 

2/15? r 

5/157T 

8/15tt 

1 1  / 1 5  7T 

Table  3:  List  of  the  AR(8)  models  used  in  Experiment  2  (with  poles  at  pie±3<3>1,  p2e±3<t>2,  p3e±3^3, 
pAe±3^). 


Input  SNR  (dB) 

10.0 

3.0 

0.0 

Output  SNR  in  [21] 

12.0 

8.0 

6.0 

Output  SNR  using  FULL-RLS 

10.5 

9.0 

7.6 

Output  SNR  using  TSP-RLS  II 

10.9 

9.8 

8.7 

Table  4:  SNR  results  of  the  TDALE  in  the  application  of  restoring  noisy  image. 
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(C) 


(d) 


Figure  2:  Block  diagram  for  (a)  a  iV-input  RLS  algorithm,  (b)  the  SP-RLS  I  algorithm,  (c)  the 
SP-RLS  II  algorithm,  (d)  the  TSP-RLS  II  algorithm. 
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y  y 


Figure  3:  Modified  QRD-RLS  array  (Projection  array)  and  its  PE  operations. 
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•  ®  no 


(a) 


Figure  4:  Systolic  implementation  of  (a)  the  SP-RLS  I,  (b)  the  SP-RLS  II,  (c)  the  TSP-RLS  II 
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Figure  5:  Simulation  results  of  AR(4).I,  II,  III,  IV,  where  the  square  root  of  the  spectral  dynamic 
range  (D)  is  also  plotted  for  comparison. 
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Figure  6:  Simulation  results  of  AR(8).I,  II,  III,  IV,  where  the  square  root  of  the  spectral  dynamic 
range  (D)  is  also  plotted  for  comparison. 
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Figure  7:  Learning  curve  of  the  FULL-RLS  and  TSP-RLS  II  after  some  initial  perturbation 
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Figure  8:  SP-RLS  I  array  with  orthogonal  preprocessing. 


(a)  (b) 


Figure  9:  Spectrum  of  (a)  the  Normal  DCT  domain  and  (b)  the  SWAP-DCT  domain. 
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(c)  (d) 


Figure  14:  (a)  Original  LENA  image,  (b)  Noisy  input  image  with  SNR=3.7  dB  (noise 
variance  =  1000).  (c)  Output  of  TDALE  with  full-size  QRD-RLS  array  (SNR=9.2  dB). 

(d)  Output  of  TDALE  with  TSP-RLS  array  (SNR=10.0  dB). 
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